{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Web scraping Yahoo Finance\n",
    "Yahoo Finance provides a tremendous amount of financial data related to a wide variety of financial instruments, in this tutorial, we will cover how to web scrape Yahoo Finance for ETF & mutual fund data. This tutorial will require two libraries, `requests` for requesting the URL, and `bs4` for parsing the HTML content of our request.\n",
    "\n",
    "We will define our `base_url` to be **https://finance.yahoo.com**, which is the main page. To query a particular instrument, we pass through the ticker symbol of our instrument. For example, if I want to query the `QQQ` ETF, then we would construct the following URL **https://finance.yahoo.com/quote/QQQ**. Once, we construct our URL, we will request the content, pass the material through to our Beautiful Soup object, and then begin parsing the content.\n",
    "\n",
    "The main landing page for any ETF or Mutual fund contains items of interest for our analysis, the first being all the links to the additional data and a summary table describing our instrument. First, let's grab both the left and right side of that summary table. The summary appears as the following:\n",
    "\n",
    "<img src=\"summary_table.jpg\" width=\"500\">\n",
    "\n",
    "With the summary table captured, let's move on to the nav menu, this will contain all the other links to the different data sources. The nav menu appears as the following:\n",
    "\n",
    "<img src=\"nav_menu.jpg\" width=\"400\">\n",
    "\n",
    "From here, loop through all the `a` tags, grab the corresponding `href` attribute and store it in a dictionary we create above the loop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Summary': 'https://finance.yahoo.com/quote/ARKQ?p=ARKQ',\n",
       " 'Historical Data': 'https://finance.yahoo.com/quote/ARKQ/history?p=ARKQ',\n",
       " 'Profile': 'https://finance.yahoo.com/quote/ARKQ/profile?p=ARKQ',\n",
       " 'Options': 'https://finance.yahoo.com/quote/ARKQ/options?p=ARKQ',\n",
       " 'Holdings': 'https://finance.yahoo.com/quote/ARKQ/holdings?p=ARKQ',\n",
       " 'Performance': 'https://finance.yahoo.com/quote/ARKQ/performance?p=ARKQ',\n",
       " 'Risk': 'https://finance.yahoo.com/quote/ARKQ/risk?p=ARKQ'}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import numpy as np\n",
    "\n",
    "# define a symbol to search\n",
    "stock_symbol = \"ARKQ\"\n",
    "\n",
    "# base url\n",
    "base = \"https://finance.yahoo.com\"\n",
    "\n",
    "# url to particular stock\n",
    "endpoint = base + \"/quote/{}\".format(stock_symbol)\n",
    "\n",
    "# request it & parse it\n",
    "response = requests.get(endpoint).content\n",
    "soup = BeautifulSoup(response, 'html.parser')\n",
    "\n",
    "# these two sections contain the information for the summary page, so grab them to be parsed later.\n",
    "left_summary_table = soup.find_all('div', {'data-test':'left-summary-table'})\n",
    "right_summary_table = soup.find_all('div', {'data-test':'right-summary-table'})\n",
    "\n",
    "# find the nav menu to get the other page links\n",
    "nav_menu = soup.find('div', {'id':'quote-nav'})\n",
    "\n",
    "# store the links in a dictionary\n",
    "link_dictionary = {}\n",
    "\n",
    "for anchor in nav_menu.find_all('a'):\n",
    "    \n",
    "    # grab the text (Page Name), link to page, and store the full link in the dictionary.\n",
    "    text = anchor.text\n",
    "    full_link = base + anchor['href']        \n",
    "    link_dictionary[text] = full_link\n",
    "\n",
    "    \n",
    "link_dictionary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we split our list as we collect them, it will make storing the data more consistent. Let's define a function that takes two parameters, a list and the number of items we want in our chunk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# build our split list function\n",
    "def split_list(my_list, chunks):    \n",
    "    return [my_list[i:i + chunks] for i in range(0, len(my_list), chunks)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's parse the summary page; this will be the easiest because it's stored in a table. Grab all the columns from the table using `find_all('td')`. From here, we can use list comprehension to store all the data, and then add that list to our `major_list`. This will make it easier to break the list into chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Previous Close', '32.85'],\n",
       " ['Open', '32.88'],\n",
       " ['Bid', ['33.16', '1200']],\n",
       " ['Ask', ['33.21', '1300']],\n",
       " [\"Day's Range\", ['32.78', '33.18']],\n",
       " ['52 Week Range', ['28.29', '37.47']],\n",
       " ['Volume', 14754.0],\n",
       " ['Avg. Volume', 22214.0],\n",
       " ['Net Assets', '176.59M'],\n",
       " ['NAV', '32.81'],\n",
       " ['PE Ratio (TTM)', nan],\n",
       " ['Yield', 0.0],\n",
       " ['YTD Return', 0.08199999999999999],\n",
       " ['Beta (3Y Monthly)', '1.44'],\n",
       " ['Expense Ratio (net)', 0.0075],\n",
       " ['Inception Date', ['2014-09-30']]]"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# SECTION ONE - PARSE THE SUMMARY PAGE\n",
    "\n",
    "# grab the `tbody` for both the left and right side.\n",
    "tbody_left = left_summary_table[0].tbody\n",
    "tbody_right = right_summary_table[0].tbody\n",
    "\n",
    "# define a list to store both tables\n",
    "major_list = []\n",
    "\n",
    "# append the parsed table to the master list.\n",
    "major_list.append([item.text for item in tbody_left.find_all('td')])\n",
    "major_list.append([item.text for item in tbody_right.find_all('td')])\n",
    "\n",
    "# create a chunked version of our master list.\n",
    "summary_data = [chunk for item in major_list for chunk in split_list(item, 2)]\n",
    "summary_data\n",
    "\n",
    "# make it number friendly\n",
    "for row in summary_data:\n",
    "\n",
    "    # handle the precentage case\n",
    "    if '%' in row[1]:        \n",
    "        row[1] = float(row[1].replace('%',''))/100\n",
    "    \n",
    "    # handle the split case X\n",
    "    elif 'x' in row[1]:\n",
    "        row[1] = row[1].split(' x ')\n",
    "    \n",
    "    # handle the split case -\n",
    "    elif '-' in row[1]:\n",
    "        row[1] = row[1].split(' - ')\n",
    "    \n",
    "    # handle the ,\n",
    "    elif ',' in row[1]:\n",
    "        row[1] = float(row[1].replace(',',''))\n",
    "     \n",
    "    # handle missing values\n",
    "    elif 'N/A' in row[1]:\n",
    "        row[1] = np.nan\n",
    "        \n",
    "\n",
    "        \n",
    "summary_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Holdings page will contain the makeup of the fund; this will be valuable when it comes to determining exposure to different sectors. It appears as the following:\n",
    "\n",
    "<img src=\"holdings_table.jpg\" width=\"500\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Stocks', 0.9994],\n",
       " ['Bonds', 0.0],\n",
       " ['Basic Materials', 0.0],\n",
       " ['CONSUMER_CYCLICAL', 0.2004],\n",
       " ['Financial Services', 0.0],\n",
       " ['Realestate', 0.0],\n",
       " ['Consumer Defensive', 0.0],\n",
       " ['Healthcare', 0.056299999999999996],\n",
       " ['Utilities', 0.0],\n",
       " ['Communication Services', 0.0],\n",
       " ['Energy', 0.0],\n",
       " ['Industrials', 0.10769999999999999],\n",
       " ['Technology', 0.6355],\n",
       " ['Price/Earnings', '28.93'],\n",
       " ['Price/Book', '3.6'],\n",
       " ['Price/Sales', '2.78'],\n",
       " ['Price/Cashflow', '17.15'],\n",
       " ['Median Market Cap', nan],\n",
       " ['3 Year Earnings Growth', nan],\n",
       " ['US Goverment', 0.0],\n",
       " ['AAA', 0.0],\n",
       " ['AA', 0.0],\n",
       " ['A', 0.0],\n",
       " ['BBB', 0.0],\n",
       " ['BB', 0.0],\n",
       " ['B', 0.0],\n",
       " ['Below B', 0.0],\n",
       " ['Others', 0.0]]"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# SECTION TWO - PARSE THE HOLDING PAGE\n",
    "\n",
    "# define the link to the page\n",
    "link = link_dictionary['Holdings']\n",
    "\n",
    "# request the link and dump the content into the parser.\n",
    "response = requests.get(link)\n",
    "soup = BeautifulSoup(response.content, 'html.parser')\n",
    "\n",
    "# We have to define a list of items we don't want. Luckily there are only a few items we need to avoid.\n",
    "skip_list = ['',stock_symbol,'Sector','Average']\n",
    "\n",
    "# find all the span elements labeled with start and end.\n",
    "items = soup.find_all('span', {'class':['Fl(start)','Fl(end)']})\n",
    "\n",
    "# loop through the parsed content, grab the text, and make sure it's not in the skip_list.\n",
    "unsplit_data = [item.text for item in items if item.text not in skip_list ]\n",
    "\n",
    "# split the list into chunks\n",
    "holdings_data = [chunk for chunk in split_list(unsplit_data, 2)]  \n",
    "\n",
    "# make number friendly\n",
    "for row in holdings_data:\n",
    "    \n",
    "    if '%' in row[1]:        \n",
    "        row[1] = float(row[1].replace('%',''))/100\n",
    "        \n",
    "    elif 'N/A' in row[1]:\n",
    "        row[1] = np.nan\n",
    "        \n",
    "holdings_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Profile page contains the fund overview and fund operations table it will appear as the following:\n",
    "\n",
    "<img src=\"fund_table.jpg\" width = \"300\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['Category', 'Technology'], ['Fund Family', 'ARK ETF Trust'], ['Net Assets', '176.59M'], ['YTD Return', '8.20%'], ['Yield', '0.00%'], ['Legal Type', 'Exchange Traded Fund'], ['Annual Report Expense Ratio (net)', 0.0075, 0.0053], ['Holdings Turnover', 0.57, 32.42], ['Total Net Assets', 16488.12, 16488.12]]\n"
     ]
    }
   ],
   "source": [
    "# SECTION THREE - PROFILE PAGE\n",
    "\n",
    "# define the link\n",
    "link = link_dictionary['Profile']\n",
    "\n",
    "# request the link and dump the content into the parser.\n",
    "response = requests.get(link)\n",
    "soup = BeautifulSoup(response.content, 'html.parser')\n",
    "\n",
    "# define a master and mini list\n",
    "profile_list = []\n",
    "mini_list = []\n",
    "\n",
    "# items to skip over.\n",
    "skip_list = ['',stock_symbol,'Attributes','Category Average']\n",
    "\n",
    "multi_section = ['Annual Report Expense Ratio (net)','Holdings Turnover','Total Net Assets']\n",
    "\n",
    "# find all the start and end spans.\n",
    "for item in soup.find_all('span', {'class':['Fl(start)','Fl(end)']}):\n",
    "    \n",
    "    # if it's not in the skip list, append it.\n",
    "    if item.text not in skip_list:\n",
    "        \n",
    "        # handle the 3 item section\n",
    "        if 'Ta(s)' in item['class'] or 'Ta(e)' in item['class']:\n",
    "\n",
    "            # make number friendly\n",
    "            if '%' in item.text:        \n",
    "                item =  float(item.text.replace('%','').replace(',',''))/100\n",
    "            elif ',' in item.text:        \n",
    "                item =  float(item.text.replace(',',''))\n",
    "            else:\n",
    "                item = item.text\n",
    "                \n",
    "            mini_list.append(item)\n",
    "\n",
    "            # once you have two items in the mini_list append to the master list.\n",
    "            if len(mini_list) == 3:            \n",
    "                profile_list.append(mini_list)\n",
    "                mini_list = []  \n",
    "                \n",
    "                \n",
    "        # handle the 2 item section\n",
    "        else:\n",
    "            mini_list.append(item.text)\n",
    "\n",
    "            # once you have two items in the mini_list append to the master list.\n",
    "            if len(mini_list) == 2:            \n",
    "                profile_list.append(mini_list)\n",
    "                mini_list = []\n",
    "\n",
    "print(profile_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Risk page contains all the risk metrics we would calculate over different timelines. Here is how the table will appear:\n",
    "\n",
    "\n",
    "<img src=\"risk_table.jpg\" width = \"500\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['Category', '3-Year Fund', '3-Year Category', '5-Year Fund', '5-Year Category', '10-Year Fund', '10-Year Category'], ['Alpha', 7.71, 9.46, 0.0, 5.86, 0.0, 5.19], ['Beta', 1.44, 1.1, 0.0, 1.05, 0.0, 1.04], ['Mean Annual Return', 1.95, 1.13, 0.0, 1.22, 0.0, 0.88], ['R-squared', 67.4, 63.2, 0.0, 62.13, 0.0, 72.67], ['Standard Deviation', 17.72, 15.69, 0.0, 15.41, 0.0, 20.82], ['Sharpe Ratio', 1.24, 0.86, 0.0, 0.95, 0.0, 0.48], ['Treynor Ratio', 15.87, 11.9, 0.0, 13.74, 0.0, 7.7]]\n"
     ]
    }
   ],
   "source": [
    "# SECTION THREE - RISK PAGE\n",
    "\n",
    "# define the link\n",
    "link = link_dictionary['Risk']\n",
    "\n",
    "# request the link and dump the content into the parser.\n",
    "response = requests.get(link)\n",
    "soup = BeautifulSoup(response.content, 'html.parser')\n",
    "\n",
    "# define a master and mini list\n",
    "risk_list = []\n",
    "mini_list = []\n",
    "\n",
    "# define the first row as the key row.\n",
    "risk_list.append(['Category','3-Year Fund','3-Year Category',\n",
    "                              '5-Year Fund','5-Year Category',\n",
    "                              '10-Year Fund','10-Year Category'])\n",
    "\n",
    "# items to skip\n",
    "list_of_nos = ['',stock_symbol,'Sector','Average']\n",
    "    \n",
    "# find the table\n",
    "for row in soup.find_all('div', class_= r\"Bdbw(1px) Bdbc($screenerBorderGray) Bdbs(s) H(25px) Pt(10px)\"):\n",
    "\n",
    "    # grab the rows\n",
    "    for item in row.find_all('span'):\n",
    "        \n",
    "        # if it's not in the skip list, append it.\n",
    "        if item.text not in list_of_nos:\n",
    "            \n",
    "            try:\n",
    "                mini_list.append(float(item.text))\n",
    "            except:\n",
    "                mini_list.append(item.text)\n",
    "\n",
    "            # chunk it\n",
    "            if len(mini_list) == 7:            \n",
    "                risk_list.append(mini_list)\n",
    "                mini_list = []\n",
    "\n",
    "print(risk_list)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The performance page, contains all the different return metrics and return benchmarks, it will appear as the following:\n",
    "\n",
    "<img src=\"performance_table.jpg\" width =\"500\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.08199999999999999, 'Year-to-Date Return (Mkt)'], [-0.05, '1-Year Total Return (Mkt)'], [0.22769999999999999, '3-Years Total Return (Mkt)'], [0.17379999999999998, 0.1118, 0.0169, '1-Month'], [-0.0288, 0.0408, 0.031, '3-Month'], [0.0928, 0.10769999999999999, 0.24109999999999998, '3-Year'], [0.1317, 0.0, 0.1436, '5-Year'], [0.0, 0.0868, 0.0, 'Last Bull Market'], [0.0, 0.0, 0.0, 'Last Bear Market'], ['2019', nan, nan], ['2018', -0.0757, nan], ['2017', 0.5241, nan], ['2016', 0.14730000000000001, nan], ['2015', -0.0227, 0.045], ['2014', nan, 0.1422]]\n"
     ]
    }
   ],
   "source": [
    "# SECTION FOUR - PERFORMANCE PAGE\n",
    "\n",
    "# define the link\n",
    "link = link_dictionary['Performance']\n",
    "\n",
    "# request the link and dump the content into the parser.\n",
    "response = requests.get(link)\n",
    "soup = BeautifulSoup(response.content, 'html.parser')\n",
    "\n",
    "# define a major and minor list.\n",
    "performance_list = []\n",
    "mini_list = []\n",
    "\n",
    "# define items to skip\n",
    "skip_list = ['',stock_symbol,'Sector','Average','Performance Overview', 'Return','Category']\n",
    "\n",
    "# find all three sections, and label them for iteration.    \n",
    "for index, row in enumerate(soup.find_all('div', class_= r\"Mb(25px)\")):\n",
    "\n",
    "    # PERFORMANCE OVERVIEW table.\n",
    "    if index == 0: \n",
    "        \n",
    "        # find all the rows.\n",
    "        for item in row.find_all('span', class_=True):\n",
    "            \n",
    "            # grab the Metric\n",
    "            if item.text not in skip_list:\n",
    "                \n",
    "                if '%' in item.text:        \n",
    "                    item =  float(item.text.replace('%',''))/100\n",
    "                else:\n",
    "                    item = item.text\n",
    "                    \n",
    "                mini_list.append(item)\n",
    "\n",
    "                # chunk it.\n",
    "                if len(mini_list) == 2:            \n",
    "                    performance_list.append(mini_list)\n",
    "                    mini_list = []\n",
    "         \n",
    "    # TRAILING RETURNS AND BENCHMARK table.\n",
    "    elif index == 1: \n",
    "        \n",
    "        # find all the rows.\n",
    "        for item in row.find_all('span', class_=True):            \n",
    "                        \n",
    "            # if it is a metric row  header, nested span tag. Define a row header key\n",
    "            if item.span != None:\n",
    "                cat = item.span.text\n",
    "            \n",
    "            # if it's not a metric row header.\n",
    "            if item.text not in skip_list and item.span == None:                \n",
    "\n",
    "                mini_list.append(float(item.text.replace('%',''))/100)\n",
    "                \n",
    "                if len(mini_list) == 3: \n",
    "                    mini_list.append(cat)\n",
    "                    performance_list.append(mini_list)\n",
    "                    mini_list = []\n",
    "               \n",
    "    # ANNUAL RETURN HISTORY table\n",
    "    elif index == 2:  \n",
    "        \n",
    "        # grab the rows.\n",
    "        for item in row.find_all('span', class_=True):\n",
    "            \n",
    "            # grab the metric\n",
    "            if item.text not in skip_list and item.span == None: \n",
    "                \n",
    "                # make number friendly\n",
    "                if '%' in item.text:        \n",
    "                    item =  float(item.text.replace('%',''))/100\n",
    "\n",
    "                elif 'N/A' in item.text:\n",
    "                    item = np.nan\n",
    "                    \n",
    "                else:\n",
    "                    item = item.text\n",
    "                    \n",
    "                mini_list.append(item)\n",
    "                \n",
    "                # chunk it.\n",
    "                if len(mini_list) == 3: \n",
    "                    performance_list.append(mini_list)\n",
    "                    mini_list = []\n",
    "        \n",
    "\n",
    "print(performance_list)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
